Using MySQL with R is pretty easy, with RMySQL. Here are a few notes to keep me straight on a few things I always get snagged on.
Typically, most folks are going to want to analyze data that’s already in a MySQL database. Being a little bass-ackwards, I often want to go the other way. One reason to do this is to do some analysis in R and make the results available dynamically in a web app, which necessitates writing data from R into a database. As of this writing, INSERT isn’t even mentioned in the RMySQL docs, sadly for me, but it works just fine.
The docs are a bit clearer for RS-DBI, which is the standard R interface to relational databases and of which RMySQL is one implementation.
The best way to close DB connections, like you would do in a finally clause in Java, is to use on.exit, like this:
con <- dbConnect(MySQL(), user="me", password="nuts2u", dbname="my_db", host="localhost") on.exit(dbDisconnect(con))
Using sprintf to build the queries feels a little primitive. As far as I can tell, there’s no prepared statements in RMySQL. I don’t suppose SQL-injection is a concern here, but prepared statements might be a little tidier, anyway.
You can process query results row by row, in blocks or all at once. The highly useful function dbGetQuery(con, sql) returns all query results as a data frame. With dbSendQuery, you can get all or partial results with fetch.
con <- dbConnect(MySQL(), user="network_portal", password="monkey2us", dbname=db.name, host="localhost") rs <- dbSendQuery(con, "select name from genes limit 10;") data <- fetch(rs, n=10) huh <- dbHasCompleted(rs) dbClearResult(rs) dbDisconnect(con)
If there’s no more results, fetch returns a data frame with 0 columns and 0 rows. dbHasCompleted is supposed to indicate whether there are more records to be fetched, but seems broken. The value of huh in the code above is false, which seems wrong to me.
A standard newbie question with MySQL is how to retrieve freshly generated primary keys from AUTO_INCREMENT fields. That’s what MySQL’s LAST_INSERT_ID() is for.
The same works with RMySQL, but there are some traps to watch out for. Let’s say you’re inserting a row into a table of networks. Don’t worry about the specifics. You want to insert related data in another table, so you need the ID of the newly inserted row.
create.network <- function(species.id, network.name, data.source, description) { con <- dbConnect(MySQL(), user="super_schmuck", password="nuts2u", dbname="my_db", host="localhost") on.exit(dbDisconnect(con)) sql <- sprintf("insert into networks (species_id, name, data_source, description, created_at) values (%d, '%s', '%s', '%s', NOW());", species.id, network.name, data.source, description) rs <- dbSendQuery(con, sql) dbClearResult(rs) id <- dbGetQuery(con, "select last_insert_id();")[1,1] return(id) }
Don’t forget to clear the result of the insert. If you do, you’ll get 0 from the last_insert_id(). Also, using dbGetQuery for the insert produces an strange error when you go to call last_insert_id:
Error in mysqlExecStatement(conn, statement, ...) : RS-DBI driver: (could not run statement: Commands out of sync; you can't run this command now)
Alternatively, you can also combine both SQL statements into one call to dbSendQuery, but, you have to remember to set a flag when you make the connection: client.flag=CLIENT_MULTI_STATEMENTS. Trying to use multiple queries seems not to work with dbGetQuery.
create.network <- function(species.id, network.name, data.source, description) { con <- dbConnect(MySQL(), user="super_schmuck", password="nuts2u", dbname="my_db", host="localhost", client.flag=CLIENT_MULTI_STATEMENTS) on.exit(dbDisconnect(con)) sql <- sprintf("insert into networks (species_id, name, data_source, description, created_at) values (%d, '%s', '%s', '%s', NOW()); select last_insert_id();", species.id, network.name, data.source, description) rs <- dbSendQuery(con, sql) if (dbMoreResults(con)) { rs <- dbNextResult(con) id <- fetch(rs)[1,1] } else { stop('Error getting last inserted id.') } dbClearResult(rs) return(id) }
Any effort saved by combining the SQL queries is lost in the extra house-keeping so I prefer the first method.
In spite of these few quirks, RMySQL generally works fine and is pretty straightforward.
Connecting to MySQL is made very easy with the RMySQL package. To connect to a MySQL database simply install the package and load the library.
install.packages("RMySQL")
library(RMySQL)
Connecting to MySQL:
Once the RMySQL library is installed create a database connection object.
mydb = dbConnect(MySQL(), user='user', password='password', dbname='database_name', host='host')
Listing Tables and Fields:
Now that a connection has been made we list the tables and fields in the database we connected to.
dbListTables(mydb)
This will return a list of the tables in our connection.
dbListFields(mydb, 'some_table')
This will return a list of the fields in some_table
.
Running Queries:
Queries can be run using the dbSendQuery
function.
dbSendQuery(mydb, 'drop table if exists some_table, some_other_table')
In my experience with this package any SQL query that will run on MySQL will run using this method.
Making tables:
We can create tables in the database using R dataframes.
dbWriteTable(mydb, name='table_name', value=data.frame.name)
Retrieving data from MySQL:
To retrieve data from the database we need to save a results set object.
rs = dbSendQuery(mydb, "select * from some_table")
I believe that the results of this query remain on the MySQL server, to access the results in R we need to use the fetch
function.
data = fetch(rs, n=-1)
This saves the results of the query as a data frame object. The n
in the function specifies the number of records to retrieve, using n=-1
retrieves all pending records.
R and MySQL Database Tutorial
library(RMySQL)
# Hello visitors, this tutorials is to connect and working with mysql server using RMySQl package
# Although, it is bit tricky and difficult to instal this package on lastest version on R, but you should do it before using this tutorial
# If in case, you find it difficult to instal RMySQl on your computer, please send me a message, so that I can develop a tutorials for that as well
library(RMySQL)
# mydb = dbConnect(MySQL(), user='manoj', password='password', host='localhost')
# creating a database using RMySQL in R
# dbSendQuery(mydb, "CREATE DATABASE bookstore;")
# dbSendQuery(mydb, "USE bookstore")
# reconnecting to database we just created using following command in R :
mydb = dbConnect(MySQL(), user='manoj', password='password', host='localhost', dbname="bookstore")
dbSendQuery(mydb, "drop table if exists books, authors")
# creating tables in bookstore:
dbSendQuery(mydb, "
CREATE TABLE books (
book_id INT,
title VARCHAR(50),
author VARCHAR(50));")
# Show table using R:
dbListTables(mydb)
## [1] "books"
# Considering our bookstore a bit more, we realize that we need to add a few more columns for data elements: publisher, publication year, ISBN number, genre (e.g., novel, poetry, drama), description of book, etc.
dbSendQuery(mydb, "ALTER TABLE books
CHANGE COLUMN book_id book_id INT AUTO_INCREMENT PRIMARY KEY,
CHANGE COLUMN author author_id INT,
ADD COLUMN description TEXT,
ADD COLUMN genre ENUM('novel','poetry','drama', 'tutorials', 'text', 'other'),
ADD COLUMN publisher_id INT,
ADD COLUMN pub_year VARCHAR(4),
ADD COLUMN isbn VARCHAR(20);")
# if R gives you an error:
# Error in mysqlExecStatement(conn, statement, ...) :
# RS-DBI driver: (connection with pending rows, close resultSet before continuing)
# reconnect database using:
mydb = dbConnect(MySQL(), user='manoj', password='password', host='localhost', dbname="bookstore")
# and then run the above command of table alteration
# Now, Before moving on to adding data to our books table, let's quickly set up the authors table.
dbSendQuery(mydb, "CREATE TABLE authors
(author_id INT AUTO_INCREMENT PRIMARY KEY,
author_last VARCHAR(50),
author_first VARCHAR(50),
country VARCHAR(50));")
# Adding data into tables
dbSendQuery(mydb, "INSERT INTO authors
(author_last, author_first, country)
VALUES('Kumar','Manoj','India');")
# fetching last data insert id number:
last_id = fetch(dbSendQuery(mydb, "SELECT LAST_INSERT_ID();"))
# Inserting data into books table and using last insert ID number:
dbSendQuery(mydb, "INSERT INTO books
(title, author_id, isbn, genre, pub_year)
VALUES('R and MySQL', 1,'6900690075','tutorials','2014');")
try1 = fetch(dbSendQuery(mydb, "SELECT book_id, title, description
FROM books
WHERE genre = 'tutorials';"))
# That's all for now... will come back soon with some advanced tutorials on R and MySQL